perf: Optimize ArrowBytesViewMap with direct view access#19975
perf: Optimize ArrowBytesViewMap with direct view access#19975Dandandan merged 12 commits intoapache:mainfrom
Conversation
|
Hi @Dandandan This PR implements the optimization you suggested in #19961. Changes made:
All 8 existing tests pass. Ready for review when you have time! Could you also approve the CI workflows to run? Thanks! |
|
run benchmarks |
|
🤖 |
| let value: &[u8] = value.as_ref(); | ||
| // Get the input value - Arrow's value() method is already optimized | ||
| // to handle inline strings efficiently | ||
| let input_value: &[u8] = values.value(i).as_ref(); |
There was a problem hiding this comment.
This part is slow for inlined bytes (u128) as it converts it back to &[u8] (the same as iter() does, so it will not really speed things up. We can also move it inside the equality check,
We should avoid it if we know either:
- all views are inlined (e.g. no buffer)
- this view is inlined (length < 12)
|
|
||
| v == value | ||
| // Compare stored value with input value | ||
| let stored_value = self.builder.get_value(header.view_idx); |
There was a problem hiding this comment.
This is slow for inlined values - we should avoid self.builder.get_value
There was a problem hiding this comment.
I suppose we either have to avoid the Builder API (store views/buffer/nulls here ourself or add some methods (e.g. .views() append_inline_view, etc.) to make it much faster
| } else { | ||
| // no existing value, make a new one. | ||
| // Only dereference bytes here when we actually need to insert | ||
| let value: &[u8] = values.value(i).as_ref(); |
There was a problem hiding this comment.
We should avoid this for inlined values (and it's the same as above input_value, so now it does it twice.
| let value: &[u8] = values.value(i).as_ref(); | ||
| let payload = make_payload_fn(Some(value)); | ||
|
|
||
| let inner_view_idx = self.builder.len(); |
There was a problem hiding this comment.
We can optimize the part of adding a new view by only adjusting the new buffer index, instead of creating a new view from scratch (slow)
|
Thanks @Tushar7012 for you PR, I left some notes what needs to be changed. |
2739149 to
45135f8
Compare
|
🤖: Benchmark completed Details
|
45135f8 to
4010560
Compare
- Use values.views() instead of values.iter() for direct u128 access - Use is_valid(i) for efficient null checking via validity bitmap - Avoid dereferencing overhead for inline strings - No additional memory overhead in Entry struct Closes apache#19961
4010560 to
51768bf
Compare
|
run benchmarks |
|
🤖 |
| return header.view == view_u128; | ||
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) |
There was a problem hiding this comment.
I think those other paths, while OK will probably not improve that much, as header.hash != hash already will filter out nearly 100% of the false positives.
There was a problem hiding this comment.
Thanks for the insight! That makes sense - the hash comparison is doing most of the heavy lifting. I'll keep the code as-is since it's still slightly better and doesn't add complexity.
|
🤖: Benchmark completed Details
|
Starting to look good! (I think we can go further when specializing more for inline cases). |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The speedup of those queries (q1 /2 in extended benchmarks) seems reproducible... if we save some of the builder overhead I think it will also speed up the other cases. |
|
Thanks for reviewing the benchmarks! Great to hear the speedup is reproducible. Regarding the builder overhead optimization for other cases - would you like me to explore that in a follow-up PR, or should we address it in this one before merging? Let me know if there's anything else needed for approval! |
I think it would be better to avoid the extra 16 bytes by either
|
|
Thanks for the detailed suggestions! To make sure I understand correctly: Option 1: Remove the builder usage entirely and store only the Option 2: Keep the current struct without the I'm leaning towards Option 1 as it seems more self-contained, but I want to confirm - for Option 1, would we be modifying the source Let me know which approach you'd prefer and I'll implement it! |
There was a problem hiding this comment.
Pull request overview
Optimizes ArrowBytesViewMap’s hot insertion/lookup path for StringViewArray/BinaryViewArray by iterating over raw u128 views and using validity-bit checks to avoid per-element Option/dereference overhead.
Changes:
- Iterate over
values.views()and usevalues.is_valid(i)for null handling. - Add fast-path equality for inlined (≤12B) values via direct
u128view comparison. - Add prefix-based precheck for non-inlined values to reduce full byte dereferencing/comparison.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// The original u128 view for fast comparison of inline strings (<=12 bytes) | ||
| /// and prefix comparison for larger strings | ||
| view: u128, | ||
|
|
There was a problem hiding this comment.
PR description claims "No additional memory overhead in Entry struct", but this change adds view: u128 to Entry, increasing per-entry size (and thus memory usage / map_size) by at least 16 bytes plus padding. If avoiding additional overhead is a requirement, consider storing only the needed metadata (e.g., len: u32 + prefix: u32) or deriving these from the builder’s stored view via view_idx; otherwise, update the PR description to match the implementation.
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | ||
| // The prefix is stored in the next 4 bytes after length | ||
| // Only dereference full bytes if prefixes match | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | ||
| if stored_prefix != input_prefix { | ||
| return false; | ||
| } | ||
|
|
||
| // Prefix matched - must compare full bytes |
There was a problem hiding this comment.
The non-inline comparison path skips a cheap length equality check before doing prefix/full-byte comparisons. Since both input and stored views encode length in the low 32 bits, checking header.view as u32 == len (or equivalent) first would avoid calling builder.get_value(...) / values.value(i) when lengths differ but hashes (and possibly prefixes) match, improving the hot path.
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | |
| // The prefix is stored in the next 4 bytes after length | |
| // Only dereference full bytes if prefixes match | |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| if stored_prefix != input_prefix { | |
| return false; | |
| } | |
| // Prefix matched - must compare full bytes | |
| // For larger strings: first compare length using the low 32 bits. | |
| // If lengths differ, the values cannot be equal, so we can skip | |
| // prefix and full-byte comparisons and avoid dereferencing bytes. | |
| if (header.view as u32) != len { | |
| return false; | |
| } | |
| // Next compare the 4-byte prefix (bytes 4-7 of u128). | |
| // The prefix is stored in the next 4 bytes after length. | |
| // Only dereference full bytes if prefixes match. | |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| if stored_prefix != input_prefix { | |
| return false; | |
| } | |
| // Prefix and length matched - now compare full bytes. |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
Prefix extraction via manual bit shifts ((view >> 32) & 0xFFFFFFFF) duplicates knowledge of the ByteView layout and is harder to maintain. There is already a pattern elsewhere to extract the prefix using Arrow helpers (e.g. GenericByteViewArray::<B>::inline_value(&view, 4) in physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs:271-275), which avoids hardcoding masks/shifts and documents intent more clearly.
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| let stored_bytes = header.view.to_le_bytes(); | |
| let stored_prefix = | |
| u32::from_le_bytes(stored_bytes[4..8].try_into().unwrap()); | |
| let input_bytes = view_u128.to_le_bytes(); | |
| let input_prefix = | |
| u32::from_le_bytes(input_bytes[4..8].try_into().unwrap()); |
| // Fast path: for inline strings (<=12 bytes), the entire value | ||
| // is stored in the u128 view, so we can compare directly | ||
| // This avoids the expensive conversion back to bytes | ||
| if len <= 12 { | ||
| return header.view == view_u128; | ||
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | ||
| // The prefix is stored in the next 4 bytes after length | ||
| // Only dereference full bytes if prefixes match | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
Several new comments describe behavior in terms of "strings" (e.g. "inline strings" / "larger strings"), but this function is generic over ByteViewType and is used for both Utf8View and BinaryView. Consider adjusting wording to "values"/"byte sequences" to avoid confusion for the binary case.
| /// Completed buffers containing string data | ||
| completed: Vec<Buffer>, | ||
| /// Tracks null values (true = null) | ||
| nulls: Vec<bool>, |
There was a problem hiding this comment.
|
run benchmarks |
|
🤖 |
Some of those unfortunately are noise, but the clickbench_extended are real improvements! |
Per reviewer feedback, replaced Vec<bool> with BooleanBufferBuilder for tracking null values. This uses bit-packed storage (1 bit per value) instead of byte-per-value, reducing memory usage by 8x for the null bitmap. Also fixed clippy warnings for mem_replace_with_default.
|
Fixed! Replaced Changes:
Ready for review! |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
Nice, starts to show up as improvement for normal clickbench as well. |
|
🤖: Benchmark completed Details
|
| // get the value as bytes | ||
| let value: &[u8] = value.as_ref(); | ||
| // Extract length from the view (first 4 bytes of u128 in little-endian) | ||
| let len = (view_u128 & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
view_u128 as u32 does the same thing
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
& 0xFFFFFFFF is not needed
|
|
||
| // For larger strings: first compare the 4-byte prefix | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
& 0xFFFFFFFF is not needed
Dandandan
left a comment
There was a problem hiding this comment.
Two small suggestions, but I think we're good to go then.
| let meta_fetch_concurrency = | ||
| ctx.config_options().execution.meta_fetch_concurrency; | ||
| let file_list = stream::iter(file_list).flatten_unordered(meta_fetch_concurrency); | ||
| let file_list = stream::iter(self.table_paths.iter()) |
There was a problem hiding this comment.
Did you commit this by accident?
There was a problem hiding this comment.
Yes, you're right - this was committed by accident. I've reverted the table.rs changes in commit 9302c3b. These changes belong in a separate PR for parallelizing list_files_for_scan. Thanks for catching it
|
Thank you @Tushar7012 ! |
Closes #19961
Which issue does this PR close?
Closes #19961
Rationale for this change
The ArrowBytesViewMap was using
values.iter()which creates unnecessaryOptionwrappers and extra overhead when iterating over byte view arrays. For ClickBench query 5, >50% CPU was spent during the intern operation.This PR optimizes the hot path in insert_if_new_inner by using direct view access methods that avoid the iteration overhead.
What changes are included in this PR?
values.iter()withvalues.views(): Access the raw&[u128]view buffer directly instead of creating Option wrappers for each valueis_valid(i)for null checking: Check validity via the bitmap instead of pattern matching onOptionvalues.value(i)only when the value is needed, avoiding unnecessary dereferencingAre these changes tested?
Yes, all existing tests pass:
binary_view_map::tests::string_view_set_emptybinary_view_map::tests::string_view_set_one_nullbinary_view_map::tests::string_view_set_many_nullbinary_view_map::tests::test_string_view_set_basicbinary_view_map::tests::test_string_set_non_utf8binary_view_map::tests::test_binary_setbinary_view_map::tests::test_mapbinary_view_map::tests::test_string_set_memory_usageAre there any user-facing changes?
No, this is an internal performance optimization with no changes to public APIs.